Skip to main content

VC-Dimension

Definition (Shattering)

A hypothesis class HH shatters a finite set CXC \subset X if the restriction of HH to CC is the set of all functions from CC to {0,1}\{0, 1\}. That is, HC=2C|H_C| = 2^{|C|}.

Corollary

Let HH be a hypothesis class of functions X{0,1}X \to \{0, 1\}. Let m be a training set size. Assume that there exists a set CXC \subset X of size 2m2m that is shattered by HH. Then, for any learning algorithm, AA, there exist a distribution DD over X×{0,1}X \times \{0, 1\} and a predictor hHh \in H such that LD(h)=0L_D(h) = 0 but with probability of at least 1/71/7 over the choice of SDmS \sim D^m we have that LD(A(S))1/8L_D(A(S)) \ge 1/8
Philosophically, if someone can explain every phenomoenon, his explanations are worthless

Definition (VC Dimension)

The VC-dimension of a hypothesis class HH, denoted VCdim(H)VCdim(H), is the maximal size of a set CXC \subset X that can be shattered by HH. If HH can shatter sets of arbitrarily large size we say that HH has infinite VC-dimension.

Theorem

Let HH be a class of infinite VC-dimension. Then, H is not PAC learnable.

Proof

Since HH has in infinite VC-dimension, for any training set size m, there exists a shattered set of size 2m, and the claim follows by the previous corollary.

How to show VCdim(H)=dVCdim(H) = d?

  1. There exists a set CC of size dd that is shattered by HH.
  2. Every set CC of size d+1d+1 is not shattered by HH.

Example for finite classes

For any finite hypothesis class HH, we have VCdim(H)logHVCdim(H) \le \log{|H|}. This is because for any set C,HCHC, |H_C| \le |H|. Therefore, if 2C>H2|C| > |H|, then we cannot shatter CC.

Will consider adding some more examples.

Definition (Growth Function)

Let HH be a hypothesis class. Then the growth function of HH, denoted τH:NN\tau_H : \mathbb{N} \to \mathbb{N}, is defined as

τH(m)=maxCX:C=mHC.\begin{gather*} \tau_H(m)=\max_{C \subset X:|C|=m} |H_C|. \end{gather*}

Sauer's Lemma

Let HH be a hypothesis class with VCdim(H)d<VCdim(H) \le d < \infty. Then, for all m,τH(m)i=0d(mi)m, \tau_H(m) \le \sum_{i=0}^d {m \choose i}. In particular, if m>d+1m > d + 1 then τH(m)(em/d)d\tau_H(m) \le (em/d)^d

Proof

(kj)=k!j!(k1)!=k(k1)(kj+1)j!=kdk1dkj+1ddjj!<(kd)jdjj!(kd)ddjj!\begin{align*} {k \choose j} & = {k! \over j!(k-1)!} \\ & = {k(k-1)\cdots (k-j+1) \over j!} \\ & = {k \over d}{k-1 \over d} \cdots {k-j+1 \over d}{d^j \over j!} \\ & < ({k \over d})^j {d^j \over j!} \\ & \le ({k \over d})^d {d^j \over j!} \end{align*}

Hence,

j=0d(kj)j=0d(kd)ddjj!(kd)dj=0djj!=(ekd)d\begin{align*} \sum_{j=0}^d {k \choose j} & \le \sum_{j=0}^d ({k \over d})^d {d^j \over j!} \\ & \le ({k \over d})^d \sum_{j=0}^\infty {d^j \over j!} \\ & = ({ek \over d})^d \end{align*}

Theorem

Let HH be a class and let τH\tau_H be its growth function. Then, for every DD and every δ(0,1)\delta \in (0, 1), with probability of at least 1δ1-\delta over the choice of SDmS \sim D^m we have

LD(h)LS(h)4+log(τH(2m))δ2m\begin{gather*} |L_D(h) - L_S(h)| \le {4+\sqrt{\log{(\tau_H(2m))}} \over \delta\sqrt{2m}} \end{gather*}

For simplicity assume that dlog(2em/d)4\sqrt{d\log{(2em/d)}} \ge 4; hence,

LD(h)LS(h)1δ2dlog(2em/d)m\begin{gather*} |L_D(h) - L_S(h)| \le {1 \over \delta} \sqrt{{2d\log{(2em/d)} \over m}} \end{gather*}

Proof

We will start by showing that

ESDm[suphHLD(h)LS(h)]4+log(τH(2m))2m.\underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}}\left|L_{\mathcal{D}}(h)-L_S(h)\right|\right] \leq \frac{4+\sqrt{\log \left(\tau_{\mathcal{H}}(2 m)\right)}}{\sqrt{2 m}} .

Since the random variable suphHLD(h)LS(h)\sup _{h \in \mathcal{H}}\left|L_{\mathcal{D}}(h)-L_S(h)\right| is nonnegative, the proof of the theorem follows directly from the preceding using Markov's inequality.

To bound the left-hand side of the previous equation we first note that for every hHh \in \mathcal{H}, we can rewrite LD(h)=ESDm[LS(h)]L_{\mathcal{D}}(h)=\mathbb{E}_{S^{\prime} \sim \mathcal{D}^m}\left[L_{S^{\prime}}(h)\right], where S=z1,,zmS^{\prime}=z_1^{\prime}, \ldots, z_m^{\prime} is an additional i.i.d. sample. Therefore,

ESDm[suphHLD(h)LS(h)]=ESDm[suphHESDmLS(h)LS(h)].\underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}}\left|L_{\mathcal{D}}(h)-L_S(h)\right|\right]=\underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}}\left|\underset{S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}} L_{S^{\prime}}(h)-L_S(h)\right|\right] .

A generalization of the triangle inequality yields

ESDm[LS(h)LS(h)]ESDmLS(h)LS(h),\left|\underset{S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left[L_{S^{\prime}}(h)-L_S(h)\right]\right| \leq \underset{S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left|L_{S^{\prime}}(h)-L_S(h)\right|,

and the fact that supermum of expectation is smaller than expectation of supremum yields

suphHESDmLS(h)LS(h)ESDmsuphHLS(h)LS(h).\sup _{h \in \mathcal{H}} \underset{S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left|L_{S^{\prime}}(h)-L_S(h)\right| \leq \underset{S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}} \sup _{h \in \mathcal{H}}\left|L_{S^{\prime}}(h)-L_S(h)\right| .

Formally, the previous two inequalities follow from Jensen's inequality. Combining all we obtain

ESDm[suphHLD(h)LS(h)]ES,SDm[suphHLS(h)LS(h)]=ES,SDm[suphH1mi=1m((h,zi)(h,zi))].(1)\begin{aligned} \underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}}\left|L_{\mathcal{D}}(h)-L_S(h)\right|\right] & \leq \underset{S, S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}}\left|L_{S^{\prime}}(h)-L_S(h)\right|\right] \\ &=\underset{S, S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^m\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right)\right|\right] . \tag{1} \end{aligned}

The expectation on the right-hand side is over a choice of two i.i.d. samples S=z1,,zmS=z_1, \ldots, z_m and S=z1,,zmS^{\prime}=z_1^{\prime}, \ldots, z_m^{\prime}. Since all of these 2m2 m vectors are chosen i.i.d., nothing will change if we replace the name of the random vector ziz_i with the name of the random vector ziz_i^{\prime}. If we do it, instead of the term ((h,zi)(h,zi))\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right) in Equation (1) we will have the term ((h,zi)(h,zi))-\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right). It follows that for every σ{±1}m\boldsymbol{\sigma} \in\{\pm 1\}^m we have that Equation (1) equals

ES,SDm[suphH1mi=1mσi((h,zi)(h,zi))]\underset{S, S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^m \sigma_i\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right)\right|\right]

Since this holds for every σ{±1}m\boldsymbol{\sigma} \in\{\pm 1\}^m, it also holds if we sample each component of σ\boldsymbol{\sigma} uniformly at random from the uniform distribution over {±1}\{\pm 1\}, denoted U±U_{\pm}. Hence, Equation (1) also equals

EσU±mES,SDm[suphH1mi=1mσi((h,zi)(h,zi))],\underset{\sigma \sim U_{\pm}^m}{\mathbb{E}} \underset{S, S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^m \sigma_i\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right)\right|\right],

and by the linearity of expectation it also equals

ES,SDmEσU±m[suphH1mi=1mσi((h,zi)(h,zi))].\underset{S, S^{\prime} \sim \mathcal{D}^m}{\mathbb{E}} \underset{\boldsymbol{\sigma} \sim U_{\pm}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^m \sigma_i\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right)\right|\right] .

Next, fix SS and SS^{\prime}, and let CC be the instances appearing in SS and SS^{\prime}. Then, we can take the supremum only over hHCh \in \mathcal{H}_C. Therefore,

EσU±m[suphH1mi=1mσi((h,zi)(h,zi))]=EσU±m[maxhHC1mi=1mσi((h,zi)(h,zi))].\begin{aligned} &\underset{\boldsymbol{\sigma} \sim U_{\pm}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}} \frac{1}{m}\left|\sum_{i=1}^m \sigma_i\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right)\right|\right] \\ &=\underset{\sigma \sim U_{\pm}^m}{\mathbb{E}}\left[\max _{h \in \mathcal{H}_C} \frac{1}{m}\left|\sum_{i=1}^m \sigma_i\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right)\right|\right] . \end{aligned}

Fix some hHCh \in \mathcal{H}_C and denote θh=1mi=1mσi((h,zi)(h,zi))\theta_h=\frac{1}{m} \sum_{i=1}^m \sigma_i\left(\ell\left(h, z_i^{\prime}\right)-\ell\left(h, z_i\right)\right). Since E[θh]=0\mathbb{E}\left[\theta_h\right]=0 and θh\theta_h is an average of independent variables, each of which takes values in [1,1][-1,1], we have by Hoeffding's inequality that for every ρ>0\rho>0,

P[θh>ρ]2exp(2mρ2).\mathbb{P}\left[\left|\theta_h\right|>\rho\right] \leq 2 \exp \left(-2 m \rho^2\right) .

Applying the union bound over hHCh \in \mathcal{H}_C, we obtain that for any ρ>0\rho>0,

P[maxhHCθh>ρ]2HCexp(2mρ2).\mathbb{P}\left[\max _{h \in \mathcal{H}_C}\left|\theta_h\right|>\rho\right] \leq 2\left|\mathcal{H}_C\right| \exp \left(-2 m \rho^2\right) .

Finally, the Lemma below tells us that the preceding implies

E[maxhHCθh]4+log(HC)2m.\mathbb{E}\left[\max _{h \in \mathcal{H}_C}\left|\theta_h\right|\right] \leq \frac{4+\sqrt{\log \left(\left|\mathcal{H}_C\right|\right)}}{\sqrt{2 m}} .

Combining all with the definition of τH\tau_{\mathcal{H}}, we have shown that

ESDm[suphHLD(h)LS(h)]4+log(τH(2m))2m.\underset{S \sim \mathcal{D}^m}{\mathbb{E}}\left[\sup _{h \in \mathcal{H}}\left|L_{\mathcal{D}}(h)-L_S(h)\right|\right] \leq \frac{4+\sqrt{\log \left(\tau_{\mathcal{H}}(2 m)\right)}}{\sqrt{2 m}} .

Lemma

Let XX be a random variable and xRx^{\prime} \in \mathbb{R} be a scalar and assume that there exists a>0a>0 and beb \geq e such that for all t0t \geq 0 we have P[Xx>\mathbb{P}\left[\left|X-x^{\prime}\right|>\right. t]2bet2/a2t] \leq 2 b e^{-t^2 / a^2}. Then, E[Xx]a(2+log(b))\mathbb{E}\left[\left|X-x^{\prime}\right|\right] \leq a(2+\sqrt{\log (b)})

Proof

For all i=0,1,2,i=0,1,2, \ldots denote ti=a(i+log(b))t_i=a(i+\sqrt{\log (b)}). Since tit_i is monotonically increasing we have that

E[Xx]alog(b)+i=1tiP[Xx>ti1]\mathbb{E}\left[\left|X-x^{\prime}\right|\right] \leq a \sqrt{\log (b)}+\sum_{i=1}^{\infty} t_i \mathbb{P}\left[\left|X-x^{\prime}\right|>t_{i-1}\right]

Using the assumption in the lemma we have

i=1tiP[Xx>ti1]2abi=1(i+log(b))e(i1+log(b))22ab1+log(b)xe(x1)2dx=2ablog(b)(y+1)ey2dy4ablog(b)yey2dy=2ab[ey2]log(b)=2ab/b=2a.\begin{array}{rl} \sum_{i=1}^{\infty} t_i & \mathbb{P}\left[\left|X-x^{\prime}\right|>t_{i-1}\right] \leq 2 a b \sum_{i=1}^{\infty}(i+\sqrt{\log (b)}) e^{-(i-1+\sqrt{\log (b)})^2} \\ & \leq 2 a b \int_{1+\sqrt{\log (b)}}^{\infty} x e^{-(x-1)^2} d x \\ & =2 a b \int_{\sqrt{\log (b)}}^{\infty}(y+1) e^{-y^2} d y \\ & \leq 4 a b \int_{\sqrt{\log (b)}}^{\infty} y e^{-y^2} d y \\ & =2 a b\left[-e^{-y^2}\right]_{\sqrt{\log (b)}}^{\infty} \\ & =2 a b / b=2 a . \end{array}

Combining the preceding inequalities we conclude our proof.